{
  "cells": [
    {
      "cell_type": "markdown",
      "id": "72867ba9-6a8c-4b14-acbf-487ea0a61836",
      "metadata": {
        "id": "72867ba9-6a8c-4b14-acbf-487ea0a61836"
      },
      "source": [
        "# Multilingual Text-to-Image Search with MultilingualCLIP\n",
        "\n",
        "<a href=\"https://colab.research.google.com/drive/10Wldbu0Zugj7NmQyZwZzuorZ6SSAhtIo\"><img alt=\"Open In Colab\" src=\"https://colab.research.google.com/assets/colab-badge.svg\"></a>\n"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "f576573b-a48f-4790-817d-e99f8bd28fd0",
      "metadata": {
        "id": "f576573b-a48f-4790-817d-e99f8bd28fd0"
      },
      "source": [
        "Most text-image models are only able to provide embeddings for text in a single language, typically English. Multilingual CLIP models, however, are models that have been trained on multiple different languages. This allows the model to produce similar embeddings for the same sentence in multiple different languages.  \n",
        "\n",
        "This guide will show you how to finetune a multilingual CLIP model for a text to image retrieval task in non-English languages.\n",
        "\n",
        "*Note, Check the runtime menu to be sure you are using a GPU/TPU instance, or this code will run very slowly.*\n"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "ed1e7d55-a458-4dfd-8f4c-eeb02521c221",
      "metadata": {
        "id": "ed1e7d55-a458-4dfd-8f4c-eeb02521c221"
      },
      "source": [
        "## Install"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "9261d0a7-ad6d-461f-bdf7-54e9804cc45d",
      "metadata": {
        "id": "9261d0a7-ad6d-461f-bdf7-54e9804cc45d"
      },
      "outputs": [],
      "source": [
        "!pip install 'finetuner[full]'"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "11f13ad8-e0a7-4ba6-b52b-f85dd221db0f",
      "metadata": {
        "id": "11f13ad8-e0a7-4ba6-b52b-f85dd221db0f"
      },
      "source": [
        "## Task"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "id": "ed1f88d4-f140-48d4-9d20-00e628c73e38",
      "metadata": {
        "id": "ed1f88d4-f140-48d4-9d20-00e628c73e38"
      },
      "source": [
        "We'll be fine-tuning multilingual CLIP on the electronics section of the [German Fashion12k dataset](https://github.com/Toloka/Fashion12K_german_queries), which contains images and descriptions of fashion products in German.\n",
        "\n",
        "The images are a subset of the [xthan/fashion-200k dataset](https://github.com/xthan/fashion-200k), and we have commissioned their human annotations via crowdsourcing platform. Annotations were made in two steps.  First, we passed the 12,000 images to annotators in their large international user community, who added descriptive captions.\n",
        "\n",
        "Each product in the dataset contains several attributes, we will be making use of the image and captions to create a [`Document`](https://finetuner.jina.ai/walkthrough/create-training-data/#preparing-a-documentarray) containing two chunks, one containing the image and another containing the category of the product."
      ]
    },
    {
      "cell_type": "markdown",
      "id": "2a40f0b1-7272-4ae6-9d0a-f5c8d6d534d8",
      "metadata": {
        "id": "2a40f0b1-7272-4ae6-9d0a-f5c8d6d534d8"
      },
      "source": [
        "## Data\n",
        "We will use the `DE-Fashion-Image-Text-Multimodal-train` dataset, which we have already pre-processed and made available on the Jina AI Cloud. You can access it using `DocArray.pull`:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "4420a4ac-531a-4db3-af75-ebb58d8f828b",
      "metadata": {
        "id": "4420a4ac-531a-4db3-af75-ebb58d8f828b"
      },
      "outputs": [],
      "source": [
        "import finetuner\n",
        "from finetuner import DocumentArray, Document\n",
        "\n",
        "finetuner.login(force=True)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "bab5c3fb-ee75-4818-bd18-23c7a5983e1b",
      "metadata": {
        "id": "bab5c3fb-ee75-4818-bd18-23c7a5983e1b"
      },
      "outputs": [],
      "source": [
        "train_data = DocumentArray.pull('finetuner/DE-Fashion-Image-Text-Multimodal-train', show_progress=True)\n",
        "eval_data = DocumentArray.pull('finetuner/DE-Fashion-Image-Text-Multimodal-test', show_progress=True)\n",
        "\n",
        "query_data = DocumentArray.pull('finetuner/DE-Fashion-Image-Text-Multimodal-query', show_progress=True)\n",
        "index_data = DocumentArray.pull('finetuner/DE-Fashion-Image-Text-Multimodal-index', show_progress=True)\n",
        "\n",
        "train_data.summary()"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "3b859e9c-99e0-484b-98d5-643ad51de8f0",
      "metadata": {
        "id": "3b859e9c-99e0-484b-98d5-643ad51de8f0"
      },
      "source": [
        "## Backbone Model\n",
        "Currently, we only support one multilingual CLIP model. This model is the `xlm-roberta-base-ViT-B-32` from [open-clip](https://github.com/mlfoundations/open_clip), which has been trained on the [`laion5b` dataset](https://github.com/LAION-AI/laion5B-paper)."
      ]
    },
    {
      "cell_type": "markdown",
      "id": "0b57559c-aa55-40ff-9d05-f061dfb01354",
      "metadata": {
        "id": "0b57559c-aa55-40ff-9d05-f061dfb01354"
      },
      "source": [
        "## Fine-tuning\n",
        "Now that our data has been prepared, we can start our fine-tuning run."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "a0cba20d-e335-43e0-8936-d926568034b3",
      "metadata": {
        "id": "a0cba20d-e335-43e0-8936-d926568034b3"
      },
      "outputs": [],
      "source": [
        "from finetuner.callback import EvaluationCallback, WandBLogger\n",
        "\n",
        "run = finetuner.fit(\n",
        "    model='clip-base-multi',\n",
        "    train_data='finetuner/DE-Fashion-Image-Text-Multimodal-train',\n",
        "    epochs=5,\n",
        "    learning_rate=1e-6,\n",
        "    loss='CLIPLoss',\n",
        "    device='cuda',\n",
        "    callbacks=[\n",
        "        EvaluationCallback(\n",
        "            query_data='finetuner/DE-Fashion-Image-Text-Multimodal-query',\n",
        "            index_data='finetuner/DE-Fashion-Image-Text-Multimodal-index',\n",
        "            model='clip-text',\n",
        "            index_model='clip-vision'\n",
        "        ),\n",
        "        WandBLogger(),\n",
        "    ]\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "6be36da7-452b-4450-a5d5-6cae84522bb5",
      "metadata": {
        "id": "6be36da7-452b-4450-a5d5-6cae84522bb5"
      },
      "source": [
        "Let's understand what this piece of code does:\n",
        "\n",
        "* We start with providing `model`, names of training and evaluation data.\n",
        "* We also provide some hyper-parameters such as number of `epochs` and a `learning_rate`.\n",
        "* We use `CLIPLoss` to optimize the CLIP model.\n",
        "* We use `finetuner.callback.EvaluationCallback` for evaluation.\n",
        "* We then use the `finetuner.callback.WandBLogger` to display our results."
      ]
    },
    {
      "cell_type": "markdown",
      "id": "923e4206-ac60-4a75-bb3d-4acfc4218cea",
      "metadata": {
        "id": "923e4206-ac60-4a75-bb3d-4acfc4218cea"
      },
      "source": [
        "## Monitoring\n",
        "\n",
        "Now that we've created a run, let's see its status. You can monitor the run by checking the status - `run.status()` - and the logs - `run.logs()` or `run.stream_logs()`. "
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "56d020bf-8095-4a83-a532-9b6c296e985a",
      "metadata": {
        "id": "56d020bf-8095-4a83-a532-9b6c296e985a",
        "scrolled": true,
        "tags": []
      },
      "outputs": [],
      "source": [
        "# note, the fine-tuning might takes 20~ minutes\n",
        "for entry in run.stream_logs():\n",
        "    print(entry)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "b58930f1-d9f5-43d3-b852-5cbaa04cb1aa",
      "metadata": {
        "id": "b58930f1-d9f5-43d3-b852-5cbaa04cb1aa"
      },
      "source": [
        "Since some runs might take up to several hours/days, it's important to know how to reconnect to Finetuner and retrieve your run.\n",
        "\n",
        "```python\n",
        "import finetuner\n",
        "\n",
        "finetuner.login()\n",
        "run = finetuner.get_run(run.name)\n",
        "```\n",
        "\n",
        "You can continue monitoring the run by checking the status - `finetuner.run.Run.status()` or the logs `finetuner.run.Run.logs()`."
      ]
    },
    {
      "cell_type": "markdown",
      "id": "f0b81ec1-2e02-472f-b2f4-27085bb041cc",
      "metadata": {
        "id": "f0b81ec1-2e02-472f-b2f4-27085bb041cc"
      },
      "source": [
        "## Evaluating\n",
        "Once the run is finished, the metrics are calculated by the {class}`~finetuner.callback.EvaluationCallback` and plotted using the {class}`~finetuner.callback.WandBLogger` callback. These plots can be accessed using the link provided in the logs once finetuning starts:\n",
        "\n",
        "```bash\n",
        "           INFO     Finetuning ... \n",
        "wandb: Currently logged in as: anony-mouse-448424. Use `wandb login --relogin` to force relogin\n",
        "wandb: Tracking run with wandb version 0.13.5\n",
        "wandb: Run data is saved locally in <path-to-file>\n",
        "wandb: Run `wandb offline` to turn off syncing.\n",
        "wandb: Syncing run ancient-galaxy-2\n",
        "wandb:  View project at <link-to-project>\n",
        "wandb:  View run at <link-to-run>\n",
        "[07:48:21] INFO     Done ✨                                                                              __main__.py:195\n",
        "           DEBUG    Finetuning took 0 days, 0 hours 8 minutes and 19 seconds                             __main__.py:197\n",
        "           DEBUG    Metric: 'clip-text-to-clip-vision_precision_at_k' Value: 0.04035                     __main__.py:206\n",
        "           DEBUG    Metric: 'clip-text-to-clip-vision_hit_at_k' Value: 0.79200                           __main__.py:206\n",
        "           DEBUG    Metric: 'clip-text-to-clip-vision_average_precision' Value: 0.41681                  __main__.py:206\n",
        "           DEBUG    Metric: 'clip-text-to-clip-vision_reciprocal_rank' Value: 0.41773                    __main__.py:206\n",
        "           DEBUG    Metric: 'clip-text-to-clip-vision_dcg_at_k' Value: 0.57113                           __main__.py:206\n",
        "           INFO     Building the artifact ...                                                            __main__.py:208\n",
        "           INFO     Pushing artifact to Jina AI Cloud ...                                                __main__.py:234\n",
        "[08:02:33] INFO     Artifact pushed under ID '63b52b5b3278416c15353bf3'                                  __main__.py:236\n",
        "           DEBUG    Artifact size is 2599.190 MB                                                         __main__.py:238\n",
        "           INFO     Finished 🚀                                                                          __main__.py:239\n",
        "```\n",
        "\n",
        "The generated plots should look like this:\n",
        "\n",
        "![WandB-mclip](https://user-images.githubusercontent.com/6599259/212645881-20071aba-8643-4878-bc53-97eb6f766bf0.png)\n"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "2b8da34d-4c14-424a-bae5-6770f40a0721",
      "metadata": {
        "id": "2b8da34d-4c14-424a-bae5-6770f40a0721"
      },
      "source": [
        "## Saving\n",
        "\n",
        "After the run has finished successfully, you can download the tuned model on your local machine:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "0476c03f-838a-4589-835c-60d1b7f3f893",
      "metadata": {
        "id": "0476c03f-838a-4589-835c-60d1b7f3f893"
      },
      "outputs": [],
      "source": [
        "artifact = run.save_artifact('mclip-model')"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "baabd6be-8660-47cc-a48d-feb43d0a507b",
      "metadata": {
        "id": "baabd6be-8660-47cc-a48d-feb43d0a507b"
      },
      "source": [
        "## Inference\n",
        "\n",
        "Now you saved the `artifact` into your host machine,\n",
        "let's use the fine-tuned model to encode a new `Document`:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "fe43402f-4191-4343-905c-c75c64694662",
      "metadata": {
        "id": "fe43402f-4191-4343-905c-c75c64694662"
      },
      "outputs": [],
      "source": [
        "text_da = DocumentArray([Document(text='setwas Text zum Codieren')])\n",
        "image_da = DocumentArray([Document(uri='https://upload.wikimedia.org/wikipedia/commons/4/4e/Single_apple.png')])\n",
        "\n",
        "mclip_text_encoder = finetuner.get_model(artifact=artifact, select_model='clip-text')\n",
        "mclip_image_encoder = finetuner.get_model(artifact=artifact, select_model='clip-vision')\n",
        "\n",
        "finetuner.encode(model=mclip_text_encoder, data=text_da)\n",
        "finetuner.encode(model=mclip_image_encoder, data=image_da)\n",
        "\n",
        "print(text_da.embeddings.shape)\n",
        "print(image_da.embeddings.shape)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "ff2e7818-bf11-4179-a34d-d7b790b0db12",
      "metadata": {
        "id": "ff2e7818-bf11-4179-a34d-d7b790b0db12"
      },
      "source": [
        "```bash\n",
        "(1, 512)\n",
        "(1, 512)\n",
        "```\n",
        "\n",
        "```{admonition} what is select_model?\n",
        "When fine-tuning CLIP, we are fine-tuning the CLIPVisionEncoder and CLIPTextEncoder in parallel.\n",
        "The artifact contains two models: `clip-vision` and `clip-text`.\n",
        "The parameter `select_model` tells finetuner which model to use for inference, in the above example,\n",
        "we use `clip-text` to encode a Document with text content.\n",
        "```\n",
        "\n",
        "```{admonition} Inference with ONNX\n",
        "In case you set `to_onnx=True` when calling `finetuner.fit` function,\n",
        "please use `model = finetuner.get_model(artifact, is_onnx=True)`\n",
        "```"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "38bc9069-0f0e-47c6-8560-bf77ad200774",
      "metadata": {
        "id": "38bc9069-0f0e-47c6-8560-bf77ad200774"
      },
      "source": [
        "## Before and after\n",
        "We can directly compare the results of our fine-tuned model with an untrained multilingual clip model by displaying the matches each model has for the same query, while the differences between the results of the two models are quite subtle for some queries, the examples below clearly show that fine-tuning increases the quality of the search results:"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "e69fdfb2-6482-45fb-9c4d-41e548ef8f06",
      "metadata": {
        "id": "e69fdfb2-6482-45fb-9c4d-41e548ef8f06"
      },
      "source": [
        "```plaintext\n",
        "results for query: \"Spitzen-Midirock Teilfutter Schwarz\" (Lace midi skirt partial lining black) using a zero-shot model and the fine-tuned model\n",
        "```\n",
        "\n",
        "before             |  after\n",
        ":-------------------------:|:-------------------------:\n",
        "![mclip-example-pt-1](https://jina-ai-gmbh.ghost.io/content/images/2022/12/mclip-before.png)  |  ![mclip-example-ft-1](https://jina-ai-gmbh.ghost.io/content/images/2022/12/mclip-after.png)\n",
        "\n",
        "\n"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3 (ipykernel)",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.9.15"
    },
    "vscode": {
      "interpreter": {
        "hash": "9ad9c14fbc5ce15e23594239b0b0bb7cf990b71472055d7d43822c20d61e1cff"
      }
    }
  },
  "nbformat": 4,
  "nbformat_minor": 5
}
